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Abstract — We study the centroid with respect to the class of 
information-theoretic Burbea-Rao divergences that generalize the 
celebrated Jensen-Shannon divergence by measuring the non- 
negative Jensen difference induced by a strictly convex and 
differentiable function. Although those Burbea-Rao divergences 
are symmetric by construction, they are not metric since they 
fail to satisfy the triangle inequality. We first explain how a 
particular symmetrization of Bregman divergences called Jensen- 
Bregman distances yields exactly those Burbea-Rao divergences. 
We then proceed by defining skew Burbea-Rao divergences, and 
show that skew Burbea-Rao divergences amount in limit cases to 
compute Bregman divergences. We then prove that Burbea-Rao 
centroids are unique, and can be arbitrarily finely approximated 
by a generic iterative concave-convex optimization algorithm with 
guaranteed convergence property. In the second part of the paper, 
we consider the Bhattacharyya distance that is commonly used to 
measure overlapping degree of probability distributions. We show 
that Bhattacharyya distances on members of the same statistical 
exponential family amount to calculate a Burbea-Rao divergence 
in disguise. Thus we get an efficient algorithm for computing 
the Bhattacharyya centroid of a set of parametric distributions 
belonging to the same exponential families, improving over 
former specialized methods found in the literature that were 
limited to univariate or "diagonal" multivariate Gaussians. To 
illustrate the performance of our Bhattacharyya/Burbea-Rao 
centroid algorithm, we present experimental performance results 
for fc-means and hierarchical clustering methods of Gaussian 
mixture models. 

Index Terms — Centroid, Kullback-Leibler divergence, Jensen- 
Shannon divergence, Burbea-Rao divergence, Bregman diver- 
gences, Exponential families, Bhattacharrya divergence, Infor- 
mation geometry. 



I. Introduction 
A. Means and centroids 

In Euclidean geometry, the centroid c of a point set V = 
{pi, ...,p n } is defined as the center of mass -X^Li^*' a ^ so 
characterized as the center point that minimizes the average 
squared Euclidean distances: c = argmin p ^" =1 -||f> — Pi\\ 2 - 
This basic notion of Euclidean centroid can be extended to 
denote a mean point MlV) representing the centrality of a 
given point set V. There are basically two complementary 
approaches to define mean values of numbers: (1) by ax- 
iomatization, or (2) by optimization, summarized concisely as 
follows: 

• By axiomatization. This approach was first historically 
pioneered by the independent work of Kolmogorov HI 
and Nagumo in 1930, and simplified and refined later 
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by Aczel (3). Without loss of generality we consider 
the mean of two non-negative numbers x\ and x 2 , and 
postulate the following expected behaviors of a mean 
function M(x\,X2) as axioms (common sense): 

- Reflexivity. M(x, x) = x, 

- Symmetry M(xi,x 2 ) = M{x 2 ,x\), 

- Continuity and strict monotonicity. M(-, •) continu- 
ous and M(xi,X2) < M(x' ll x 2 ) for x\ < x[, and 

- Anonymity. M(M(x n , xi 2 ), M(x 2 i, £22)) = 
M(M(x n ,x 21 ),M(xi 2 ,x 22 )) (also called 
bisymmetry expressing the fact that the mean 
can be computed as a mean on the row means or 
equivalently as a mean on the column means). 

Then one can show that the mean function M(-,-) is 
necessarily written as: 

M{x 1 ,x 2 ) = r 1 (l^±IM) d = Mf(xi,x 2 ), 

V J (1) 

for a strictly increasing function /. The arithmetic Xl + X2 , 
geometric yjx\x 2 and harmonic means j_ ? j_ are in- 

stances of such generalized means obtained for f(x) = x, 
f(x) = log a; and f(x) — respectively. Those general- 
ized means are also called quasi-arithmetic means, since 
they can be interpreted as the arithmetic mean on the se- 
quence f(xi), f(x n ), the /-representation of numbers. 
To get geometric centroids, we simply consider means 
on each coordinate axis independently. The Euclidean 
centroid is thus interpreted as the Euclidean arithmetic 
mean. Barycenters (weighted centroids) are similarly 
obtained using non-negative weights (normalized so that 



M f (x 1 ,...,x n ;w 1 ,...,w n ) = / 1 \y2wif(xi)J (2) 

Those generalized means satisfy the inequality property: 

M f (xi,...,x n ;wi, ...,w n ) < M g (xi, ...,x n ;wi, ...,w n ), 

(3) 

if and only if function g dominates /: That is, Vx, g(x) > 
f(x). Therefore the arithmetic mean (J(x) = x) domi- 
nates the geometric mean (f(x) = logs) which in turn 
dominates the harmonic mean f(x) — -. Note that it 
is not a strict inequality in Eq. [3] as the means coincide 
for all identical elements: if all Xi are equal to x then 
M f ( Xl ,...,x n ) = rHfix)) = x = g-^gix)) = 
M g {x\, ...,x n ). All those quasi-arithmetic means further 
satisfy the "interness" property 
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min(xi, ...,x n ) < Mf(xi, ...,x n ) < max(xi 7 ...,x n ), 

(4) 

derived from limit cases p — > ±00 of power means^] for 

f(x) = x p ,p G KU = (— 00, oo)\{0}, a non-zero real 
number. 

• By optimization. In this second alternative approach, the 
barycenter c is defined according to a distance function 
d(-, •) as the optimal solution of a minimization problem 

n 

(OPT) : min > Wid(x,pi) = mini(x; V, d), (5) 

x — ' J X 

i=l 

where the non-negative weights Wi denote multiplicity 
or relative importance of points (by default, the centroid 
is defined by fixing all Wi = —). Ben-Tal et al. 
considered an information-theoretic class of distances 
called /-divergences 0, 0: 



!f{ x ,p) =Pf 



(6) 



for a strictly convex differentiable function /(•) satisfying 
/(l) = and /'(l) = 0. Although those /-divergences 
were primarily investigated for probability measures^] we 
can extend the /-divergence to positive measures. Since 
program (OPT) is strictly convex in x, it admits a unique 
minimizer M(V; If) = arg min^ L(x; V, If), termed the 
entropic mean by Ben-Tal et al. [4J. Interestingly, those 
entropic means are linear scale-invariant]^] 



M(Xpi, Xp n ; If) = \M{px, ...,p n ;I f ) 



(7) 



Nielsen and Nock Q considered another class of 
information-theoretic distortion measures Bp called 
Bregman divergences [8|, |9|: 

B F (x,p) = F(x) - F(p) - (x - p)F'(p), (8) 

for a strictly convex differentiable function F. It follows 
that (OPT) is convex, and admits a unique minimizer 
M(p 1 ,...,p n ;B F ) = M FI (jp 1 ,...,p n ), a quasi-arithmetic 
mean for the strictly increasing and continuous func- 
tion F', the derivative of F. Observe that information- 
theoretic distances may be asymmetric (i.e., d(x,p) 7^ 
d(p, x)), and therefore one may also define a right-sided 
centroid M' as the minimizer of 



(OPT') : minY^ w t d(pi,x), 



i=l 



(9) 



It turns out that for /-divergences, we have: 

If(x,p)=If4p,x), (10) 

'Besides the min/max operators interpreted as extremal power means, the 

geometric mean itself can also be interpreted as a power mean (n™=i x i ) p 
in the limit case p — > 0. 

2 In that context, a d-dimensional point is interpreted as a discrete and finite 
probability measure lying in the (d — 1) -dimensional unit simplex. 

3 That is, means of homogeneous degree 1. 



for f*(x) = xf(l/x) so that (OPT') is solved as a (OPT) 
problem for the conjugate function /*(•). In the same 
spirit, we have: 



B F (x,p) = B F ,(F , (p),F'(x)) 



(11) 



for Bregman divergences, where F* denotes the Legendre 
convex conjugate O, (5 ]]j Surprisingly, although (OPT') 
may not be convex in x for Bregman divergences (e.g., 
F{x) = — logx), (OPT') admits nevertheless a unique 
minimizer, independent of the generator function F: the 
center of mass M^V; B F ) = ^"=1 Pi- B re g man means 
are not homogeneous except for the power generators 
F(x) = x p which yields entropic means, i.e. means 
that can also be interpretecQ as minimizers of average /- 
divergences Amari ifTlT further studied those power 
means (known as a-means in information geometry fl~2]), 
and showed that they are linear-scale free means ob- 
tained as minimizers of a-divergences, a proper sub- 
class of /-divergences. Nielsen and Nock [13] reported 
an alternative simpler proof of a-means by showing 
that the a-divergences are Bregman divergences in dis- 
guise (namely, representational Bregman divergences for 
positive measures, but not for normalized distribution 
measures [ 10 1). To get geometric centroids, we simply 
consider multivariate extensions of the optimization task 
(OPT). In particular, one may consider separable di- 
vergences that are divergences that can be assembled 
coordinate-wise: 



(12) 



with a;W denoting the ith coordinate. A typical non 
separable divergence is the squared Mahalanobis dis- 
tance fl4l : 



d(x,p) = (x -p) T Q{x -p), 



(13) 



a Bregman divergence called generalized quadratic dis- 
tance, defined for the generator F(x) = x T Qx, where 
Q is a positive-definite matrix (Q >~ 0). For separable 
distances, the optimization problem (OPT) may then be 
reinterpreted as the task of finding the projection lfT31 of 
a point p (of dimension d x n) to the upper line U: 



(PRO J) : inf d(u,p) 



(14) 



with u\ = ... = Udxn > 0, and p the (n x d) -dimensional 
point obtained by stacking the d coordinates of each of 
the n points. 

In geometry, means (centroids) play a crucial role in center- 
based clustering (i.e., fc-means |16| for vector quantization 
applications). Indeed, the mean of a cluster allows one to 
aggregate data into a single center datum. Thus the notion 

4 Legendre dual convex conjugates F and F* have necessarily reciprocal 
gradients: F*' = (F')' 1 . See Q. 

5 In fact, Amari 1101 proved that the intersection of the class of /- 
divergences with the class of Bregman divergences are a-divergences. 
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of means are encapsulated into the broader theory of mathe- 
matical aggregators ifTTl . 

Results on geometric means can be easily transfered to the 
field of Statistics |4) by generalizing the optimization problem 
task to a random variable X with distribution F as: 



(OPT) : mmE[Xd(x,X)} = min / td(x,t)dF{t), (15) 
x x J t 

where E[-] denotes the expectation defined with respect to 
the Lebesgue-Stieltjes integral. Although this approach is 
discussed in H and important for defining various notions 
of centrality in statistics, we shall not cover this extended 
framework here, for sake of brevity. 



B. Burbea-Rao divergences 

In this paper, we focus on the optimization approach 
(OPT) for defining other (geometric) means using the class of 
information-theoretic distances obtained by Jensen difference 
for a strictly convex and differentiable function F: 



d(x,p) 



F(x) + F(p) 
2 



F 



x +p 
2 



dcf 



BR F (x,p) > 0. 

(16) 

Since the underlying differential geometry implied by those 
Jensen difference distances have been seminally studied in 
papers of Burbea and Rao [18|, fl9l . we shall term them 
Burbea-Rao divergences, and point out to them as BR^-. In 
the remainder, we consider separable Burbea-Rao divergences. 
That is, for d-dimensional points p and q, we define 



(17) 



and study the Burbea-Rao centroids (and barycenters) as the 
minimizers of the average Burbea-Rao divergences. Those 
Burbea-Rao divergences generalize the celebrated Jensen- 
Shannon divergence 



JS(p,q)=H 



P 



Hip) + H(q) 



(18) 



by choosing F(x) — —H(x), the negative Shannon entropy 
H{x) = — x log x. Generators F(-) of parametric distances 
are convex functions representing entropies which are concave 
functions. Burbea-Rao divergences contain all generalized 
quadratic distances (F(x) = x T Qx = (Qx, x) for a positive 
definite matrix Q y 0, also called squared Mahalanobis 
distances): 



BRf(p, q) 



F(p) + F(q) _ ( p + q 



2 V 2 

2(Qp,p) + 2(Qq, q) - (Q{p + q),p + q) 
4 

\((Qp,p) + (Qq,q)-2(Qp,q)) 

\{Q{p- q),p- q) = \\\p-q\\ 2 Q - 



Although the square root of the Jensen-Shannon diver- 
gence yields a metric (a Hilbertian metric), it is not true 
in general for Burbea-Rao divergences. The closest work to 
our paper is a 1-page symposium^] paper f2D discussing 
about Ali-Silvey-Csiszar /-divergences Q, J6] and Bregman 
divergences ll22l . (8) (two entropy-based divergence classes). 
Those information-theoretic distortion classes are compared 
using quadratic differential metrics, mean values and projec- 
tions. The notion of skew Jensen differences intervene in the 
discussion. 

C. Contributions and paper organization 

The paper is articulated into two parts: The first part studies 
the Burbea-Rao centroids, and the second part shows some 
applications in Statistics. We summarize our contributions as 
follows: 

• We define the parametric class of (skew) Burbea-Rao 
divergences, and show that those divergences naturally 
arise when generalizing the principle of the Jensen- 
Shannon divergence ll20ll to Jensen-Bregman divergences. 
In the limit cases, we further prove that those skew 
Burbea-Rao divergences yield asymptotically Bregman 
divergences. 

• We show that the centroids with respect to the (skew) 
Burbea-Rao divergences are unique. Besides centroids 
for special cases of Burbea-Rao divergences (including 
the squared Euclidean distances), those centroids are 
not available in closed-form equations. However, we 
show that any Burbea-Rao centroid can be estimated 
efficiently using an iterative convex-concave optimization 
procedure. As a by-product, we find Bregman sided 
centroids [7| in closed-form in the extremal skew cases. 

We then consider applications of Burbea-Rao centroids in 
Statistics, and show the link with Bhattacharyya distances. A 
wide class of statistical parametric models can be handled in 
a unified manner as exponential families l23ll . The classes of 
exponential families contain many of the standard parametric 
models including the Poisson, Gaussian, multinomial, and 
Gamma/Beta distributions, just to name a few prominent 
members. However, only a few closed-form formulas for the 
statistical Bhattacharyya distances between those densities are 
reported in the literature^] 

For the second part, our contributions are reviewed as 
follows: 

• We show that the (skew) Bhattacharyya distances calcu- 
lated for distributions belonging to the same exponential 
family in statistics, are equivalent to (skew) Burbea- 
Rao divergences. We mention corresponding closed-form 
formula for computing Chernoff coefficients and a- 
divergences of exponential families. In the limit case, we 
obtain an alternative proof showing that the Kullback- 
Leibler divergence of members of the same exponential 

6 In the nineties, the IEEE International Symposium on Information Theory 
(ISIT) published only 1-page papers. We are grateful to Prof. Michele 
Basseville for sending us the corresponding slides. 

7 For instance, the Bhattacharyya distance between multivariate normal 
distributions is given here 1241 . 
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family is equivalent to a Bregman divergence calculated 
on the natural parameters lfl4l . 

• We approximate iteratively the Bhattacharyya centroid of 
any set of distributions of the same exponential family 
(including multivariate Gaussians) using the Burbea-Rao 
centroid algorithm. For the case of multivariate Gaus- 
sians, we design yet another tailored iterative scheme 
based on matrix differentials, generalizing the former 
univariate study of Rigazio et al. |25|. Thus we get either 
the generic way or the tailored way for computing the 
Bhattacharrya centroids of arbitrary Gaussians. 

• As a field application, we show how to simplify Gaus- 
sian mixture models using hierarchical clustering, and 
show experimentally that the results obtained with the 
Bhattacharyya centroids compare favorably well with 
former results obtained for Bregman centroids [26 1 . Our 
numerical experiments show that the generic method out- 
performs the alternative tailored method for multivariate 
Gaussians. 

The paper is organized as follows: In section [II] we intro- 
duce Burbea-Rao divergences as a natural extension of the 
Jensen-Shannon divergence using the framework of Bregman 
divergences. 



Ill which considers 



It is followed by Section 
the general case of skew divergences, and reveals asymptotic 
behaviors of extreme skew Burbea-Rao divergences as Breg- 



man divergences. Section IV defines the (skew) Burbea-Rao 
centroids, show they are unique, and present a simple iterative 
algorithm with guaranteed convergence. We then consider 
applications in Statistics in Section [V] After briefly recalling 
exponential distributions in S V-A we show that Bhattacharyya 



distances and Chernoff/Amari a-divergences are available in 
closed-form equations as Burbea-Rao divergences for distribu- 
tions of the same exponential families. Section [V-C| presents 
an alternative iterative algorithm tailored to compute the Bhat- 
tacharyya centroid of multivariate Gaussians, generalizing the 
former specialized work of Rigazio et al. l25l . In section V-D 
we use those Bhattacharyya/Burbea-Rao centroids to simplify 
hierarchically Gaussian mixture models, and comment both 
qualitatively and quantitatively our experiments on a color 



image segmentation application. Finally, section VI concludes 
this paper by describing further perspectives and hinting at 
some information geometrical aspects of this work. 

II. Burbea-Rao divergences from symmetrization 
of Bregman divergences 

Let R + = [0, +00) denote the set of non-negative reals. For 
a strictly convex (and differentiable) generator F, we define 
the Burbea-Rao divergence as the following non-negative 
function: 



BR F : X X X ] 
(P,q) H> BR F (p,q) 



F(p) + F(q) 
2 



F 



P + Q 
2 



> 



The non-negative property of those divergences follows 
straightforwardly from Jensen inequality. Although Burbea- 
Rao distances are symmetric (BR F (p, q) = BKp(q,p)), they 




Fig. 1. Interpreting the Burbea-Rao divergence BRp (p, q) as the vertical 
distance between the midpoint of segment [(p, F(p)), (q, F(q))] and the 

midpoint of the graph plot ( ^ , -F ( Hf 2 ) ) ■ 




1 

Ha 







= — — e 


e 



V 



Fig. 2. Interpreting the Bregman divergence Bp(p, q) as the vertical distance 
between the tangent plane at q and its translate passing through p (with 
identical slope V-F(q)). 



are not metrics since they fail to satisfy the triangle inequality. 
A geometric interpretation of those divergences is given in 
Figure [T] Note that F is defined up to an affine term ax + b. 

We show that Burbea-Rao divergences extend the Jensen- 
Shannon divergence using the broader concept of Bregman 
divergences instead of the Kullback-Leibler divergence. A 
Bregman divergence ll22l . |8l . [9] Bp is defined as the positive 
tail of the first-order Taylor expansion of a strictly convex and 
differentiable convex function F: 

B F (p, q) = F{p) - F(q) -{p-q, VF(g)), (19) 

where Vi* 1 denote the gradient of F (the vector of partial 
derivatives {§^-}i), and (x,y) = x T y the inner product (dot 
product for vectors). A Bregman divergence is interpreted 
geometrically [14] as the vertical distance between the tangent 
plane H q at q of the graph plot T = {x — (x, F(x)) \x € X} 
and its translates H' passing through p = (p,F(p)). Fig- 
ure [2] depicts graphically the geometric interpretation of the 
Bregman divergence (to be compared with the Burbea-Rao 
divergence in Figure [TJ. 

Bregman divergences are never metrics, and symmetric 
only for the generalized quadratic distances [14] obtained by 
choosing F(x) = x T Qx, for some positive definite matrix 
Q >~ 0. Bregman divergences allow one to encapsulate both 
statistical distances with geometric distances: 
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Kullback-Leibler divergence obtained for F(x) = 
x log x: 



KL(p, g )=^ P «log| w 
»=i q 

squared Euclidean distance obtained for F(x) 



(20) 



— *3- 



L 2 2 (p, q ) = Y / (P (l) -<l 



\p-q\\ 2 (21) 



Basically, there are two ways to symmetrize Bregman di- 
vergences (see also work on Bregman metrization 11271 . (281): 
• Jeffreys-Bregman divergences. We consider half of the 
double-sided divergences: 



c / \ B F {p,q) + B F {q,p) 

-Mp;<?) = g (1T) 

= ^(p-q,VF(p)-VF(q)), (23) 

Except for the generalized quadratic distances, this sym- 
metric distance cannot be interpreted as a Bregman 
divergence [14|. 

Jensen-Bregman divergences. We consider the Jeffreys- 
Bregman divergences from the source parameters to the 
average parameter as follows: 



Nielsen and Nock |7| investigated the centroids with respect 
to Jeffreys-Bregman divergences (the symmetrized Kullback- 
Leibler divergence). 

III. Skew Burbea-Rao divergences 

We further generalize Burbea-Rao divergences by intro- 
ducing a positive weight a 6 (0, 1) when averaging source 
parameters p and q as follows: 



(a) 



BR 

BRP (p,q) 



X x X R + 

aF{p) + (1 - a)F{q) - F(ap + (1 - a)q) 



We consider the open interval (0, 1) since otherwise the 
divergence has no discriminatory power (indeed, for a 6 
{0, 1}, BRp*\p, q) = 0, Vp, q). Although skewed divergences 
are asymmetric BR^(p, q) ^ BRp(q,p), we can swap 
arguments by replacing a by 1 — a: 



BR 



{ F \ Pl q) = aF(p) + (l-a)F(q)-F(ap + (l-a)q) 



= BR 



(i- 



\q,p) 



(28) 



Those skew Burbea-Rao divergences are similarly found us- 
ing a skew Jensen-Bregman counterpart (the gradient terms 
VF{ap + (1 — a)q) perfectly cancel in the sum of skew 
Bregman divergences): 



Jf(p; q) 



B F {p,e±^) + B F (q,^) 



(24) aB F (p, ap + (1 - a)q) + (1 - a)B F (q, ap + (1 - a)q) 



_ m±m. F{ p±i )=BRrM 

Note that even for the negative Shannon entropy F(x) = 
x log x — x (extended to positive measures), those two sym- 
metrizations yield different divergences: While S F uses the 
gradient VF, J F relies only on the generator F. Both J F 
and S F have always finite values]^] The first symmetrization 
approach was historically studied by Jeffreys (29). 

The second way to symmetrize Bregman divergences gen- 
eralizes the spirit of the Jensen-Shannon divergence [20 1 



JS(p, q) = \ (KL (p, + KL (q, ^ ] ) (25 ) 



= H 



p + q\ H{p) + H(q) 



(26) 



with non-negativity that can be derived from Jensen's in- 
equality, hence its name. The Jensen-Shannon divergence is 
also called the total divergence to the average, a generalized 
measure of diversity from the population distributions p and q 
to the average population ^y 2 . Those Jensen difference-type 
divergences are by definition Burbea-Rao divergences. For the 
Shannon entropy, those two different information divergence 
symmetrizations (Jensen-Shannon divergence and Jeffreys J 
divergence) satisfy the following inequality: 



J(p,g)>4 JS(p,g) >0. 



(27) 



This may not be the case of Bregman/Kullback-Leibler divergences that 
can potentially be unbounded. 



dof 



BR { F \p,q) 

In the limit cases, a — > or a — > 1, we have BR^ (p, q) — > 
Vp, q. That is, those divergences loose their discriminatory 
power at extremities. However, we show that those skew 
Burbea-Rao divergences tend asymptotically to Bregman di- 
vergences: 



B F {p,q) 



lim -BR ( " V?) 

a->0 a 



B F {q,p) = lim 



1 



l 1 - a 



BR { F \p,q) 



(29) 
(30) 



The limit in the right-hand-side of Eq. [30] can be expressed 
alternatively as the following one-sided limit: 



lim 

atl 1 



1 



BRW(p >g )=lini-BR^(g,p) 



1 



a^0 a 



(31) 



where the arrows f an d I denote the limit from the left and 
the limit from the right, respectively (see [30] for notations). 
The right derivative of a function / at x is defined as f' + (x) — 

Ymv yix 1{y) ~_{ {x) ■ Since BR^ (>,<?) = Vp,q, it follows that 
the right-hand-side limit of Eq. [3T] is the right derivative (see 
Theorem 1 of [ 30 1 that gives a generalized Taylor expansion 
of convex functions) of the map 



L(a) : a h-> BR^ {q,p) 
taken at a = 0. Thus we have 



(32) 
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]im-BR^(q,p) = L' + (0)., 
c40 a 



(33) 



with 



L' + (0) = -±(aF(q) + (l-a)F(p)-F(aq+(l-a)p)) 



F(q)-F(p)~(q-p,VF(p)) 

E-F(q,p) 



(34) 
(35) 



Lemma 1: Skew Burbea-Rao divergences tend asymptoti- 
cally to Bregman divergences (a — > 0) or reverse Bregman 
divergences (a — > 1). 

Thus we may scale skew Burbea-Rao divergences so that 
Bregman divergences belong to skew Burbea-Rao divergences: 



sBR^M 
1 



a(l — a) 



(aF(p) + (1 - a)F(q) - F(ap + (1 - a)q)) 

(36) 



Moreover, a is now not anymore restricted to (0, 1) but 
to the full real line: a £ K, as also noticed in OTI . Setting 
a = i^ 9 - (that is, a' = 1 — 2a), we get 



(say, the barycenter Co = Yn=i w iPi)> an d iteratively update 
the barycenter as follows: 



ct+i = VF" 



WiCti 



^WiOiVF (ciiCt 



i=l 



WiCti 



WiOiVF (cxiCt 




Since F is convex, the second-order derivative V 2 F is 
always positive definite, and VF is strictly monotone in- 
creasing. Thus we can interpret Eq. 41 as a fixed-point 



equation by considering the VF-representation. Each iteration 
is interpreted as a quasi-arithmetic mean. This proves that the 
Burbea-Rao centroid is always well-defined and unique, since 
there is (at most) a unique fixed point for x = g(x) with a 
function g(-) strictly monotone increasing. 

In some cases, like the squared Euclidean distance (or 
squared Mahalanobis distances), we find closed-form solutions 
for the Burbea-Rao barycenters. For example, consider the 
(negative) quadratic entropy F(x) = (x,x) = Yli=i( x ^) 2 
with weights Wi and all a* = \ (non-skew symmetric Burbea- 
Rao divergences). We have: 



sBR' 



4 



1 



oc 



12 



1 - a' . N 1 + a' , . ( 1 - a' 



IV. Burbea-Rao centroids 

Let V = {p\,. denote a c?-dimensional point set. 
To each point, let us further associate a positive weight Wi 
(accounting for arbitrary multiplicity) and a positive scalar 
en € (0,1) to define an anchored distance BR^ 1 ' {-,Pi)- 
Define the skew Burbea-Rao barycenter (or centroid) c as the 
minimizer of the following optimization task: 



OPT 



argmm^ WiHTLp (x,Pi 



=i 



argminL(a;) 



(38) 

Without loss of generality, we consider argument x on the 
left argument position (otherwise, we change all a, — > 1— a, to 
get the right-sided Burbea-Rao centroid). Removing all terms 
independent of x, the minimization program (OPT) amounts 
to minimize equivalently the following energy function: 



E(c) = w l a l )F{c) - WiF{a iC + (1 - o^Pi) (39) 

i=l i=l 

Observe that the energy function is decomposable in the 
sum of a convex function (X)"=i WiCti)F(c) with a concave 
function — Yn=i w iF(aiC + (1 — a.i)pi) (since the sum of n 
concave functions is concave). We can thus solve iteratively 
this optimization problem using the Convex-ConCave Proce- 
dure [32], [33] (CCCP), by starting from an initial position cq 



(42) 



(37) . {x,x) lv-^ , 
^ min— -^w. {{x,x) + 2(x,pi) + {p l ,p I }) 

i=l 

The minimum is obtained when the gradient WE(x) = 0, 
that is when x = p = Yn=i w iPi> tne barycenter of the point 
set V. For most Burbea-Rao divergences, Eq. 42 can only be 
solved numerically. 

Observe that for extremal skew cases (for a — > or a — > 1), 
we obtain the Bregman centroids in closed-form solutions (see 



Eq. 30 1. Thus skew Burbea-Rao centroids allow one to get a 
smooth transition from the right-sided centroid (the center of 
mass) to the left-sided centroid (a quasi-arithmetic mean Mt 
obtained for / = VF, a continuous and strictly increasing 
function). 

Theorem 1: Skew Burbea-Rao centroids are unique. They 
can be estimated iteratively using the CCCP iterative algo- 
rithm. In extremal skew cases, the Burbea-Rao centroids tend 
to Bregman left/right sided centroids, and have closed-form 
equations in limit cases. 

To describe the orbit of Burbea-Rao centroids linking the 
left to right sided Bregman centroids, we compute for a E 
[0, 1] the skew Burbea-Rao centroids with the following update 
scheme: 



Ct+i = VF- 1 J2 w l VF(ac t + (1 - a) Pi 



(43) 



We may further consider various convex generators Fj for 
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each point, and consider the updating scheme 

c*+i = 



VVVFi] |==nr^ VtOiaiVFi(a!iCt + (l- 



A. Burbea-Rao divergences of a population 

Consider now the Burbea-Rao divergence of a popula- 
tion px,...,p n with respective positive normalized weights 
wi, ...,w n . The Burbea-Rao divergence is defined by: 



BR£( Pl ,...,p„) 



WiFfa) - F(yw iPi ) >0 (44) 



=i 



This family of diversity measures includes the Jensen- 
Renyi divergences [34], [35] for F(x) — —R a (x), where 
Ra{x) — jzr^ l°g Ylj=i Pf i s tne Renyi entropy of order a. 
(Renyi entropy is concave for a 6 (0, 1) and tend to Shannon 
entropy for a — > 1.) 

V. Bhattacharyya distances as Burbea-Rao 

DISTANCES 

We first briefly recall the versatile class of exponential fam- 



ily distributions in Section [V-A| Then we show in Section |V-B| 
that the statistical Bhattacharyya/Chernoff distances between 
exponential family distributions amount to compute a Burbea- 
Rao divergence. 

A. Exponential family distribution in Statistics 

Many usual statistical parametric distributions p(x; A) (e.g., 
Gaussian, Poisson, Bernoulli/multinomial, Gamma/Beta, etc.) 
share common properties arising from their common canonical 
decomposition of probability distribution 

p(x; A) = p F (x; 9) = exp ((t(x),6) - F{9) + k(x)) . (45) 
Those distribution^] are said to belong to the exponential 



families (see [23] for a tutorial). An exponential family is 
characterized by its log-normalize r F(9), and a distribution in 
that family by its natural parameter 9 belonging to the natural 
space 0. The log-normalizer F is strictly convex and C°°, and 
can also be expressed using the source coordinate system A 
using the 1-to-l map r : A —> that converts parameters 
from the source coordinate system A to the natural coordinate 
system 9: 



F(9)=F(r(X)) = (For)(X)=F x (X), 



(46) 



where F\ = For denotes the log-normalizer function 
expressed using the A-coordinates instead of the natural 9- 
coordinates. 

The vector t(x) denote the sufficient statistics, that is the 
set of linear independent functions that allows to concentrate 

'The distributions can either be discrete or continuous. We do not introduce 
the unifying framework of probability measures in order to not burden the 
paper. 



without any loss all information about the parameter 9 carried 
in the iid. observations x\, x%, . The inner product (p, q) is 
defined according to the primitive type of 9. Namely, it is a 
; multiplication (p, q) = pq for scalars, a dot product (p, q) = 
pf q for vectors, a matrix trace (p, q) = tr(p T xq) = tr(pxq T ) 
for matrices, etc. For composite types such as p being defined 
by both a vector part and a matrix part, the composite inner 
product is defined as the sum of inner products on the primitive 
types. Finally, k(x) represents the carrier measure according to 
the counting or Lebesgue measures. Decompositions for most 
common exponential family distributions are given in [23 1. 
An exponential family £p = {pf(x;9) \9 £ 0} is the set of 
probability distributions obtained for the same log-normalizer 
function F. Information geometry considers £p as a manifold 
entity, and study its differential geometric properties lfl2l . 

For example, consider the family of Poisson distributions 
£p with mass function: 



p(x; A) 



/V 



exp(-A), 



(47) 



for x € N + = NU {0} a positive integer. Poisson distributions 
are univariate exponential families (x € N+) of order 1 
(parameter A). The canonical decomposition yields 

• the sufficient statistic t(x) = x, 

• 9 = log A, the natural parameter, 

• F(9) = exp (9, the log-normalizer, 

• and k(x) = — logx! the carrier measure (with respect to 
the counting measure). 

Since we deal with applications using multivariate nor- 
mals in the following, we also report explicitly that canon- 
ical decomposition for the multivariate Gaussian family 
{pf(x;9) \9 6 0}. We rewrite the usual Gaussian density 
of mean \i and variance-covariance matrix E: 



p(x;X) = p(x;fi,T,) 
1 



27rv / det £ 



exp 



(48) 



in the canonical form of Eq. 45 with, 

. 9 = (£-V, iE" 1 ) G = R d x K dxd , with K dxd 

denotes the cone of positive definite matrices, 
. F(0)= itr^Mf ) - | log det 6» 2 + | log tt, 
• t(x) — (x, —x T x), 
. k(x) = 0. 

In this case, the inner product is composite and is calculated 
as the sum of a dot product and a matrix trace as follows: 



9') = 



aTnl 



trl 



(50) 



The coordinate transformation r : A — > is given for A 

(M, S) by 



r(A)= (VA!,^ 1 
and its inverse mapping t _1 : — > A by 



(51) 



(52) 
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B. Bhattacharyya/Chenioff coefficients and a-divergences as 
skew Burbea-Rao divergences 

For arbitrary probability distributions p{x) and q(x) (para- 
metric or not), we measure the amount of overlap between 
those distributions using the Bhattacharyya coefficient (36]: 



C(P, q)= y / p(x)q(x)dx, 



(53) 



Clearly, the Bhattacharyya coefficient (measuring the affinity 
between distributions |37|) falls in the unit range: 



0<C(p,q) < 1. 



(54) 



In fact, we may interpret this coefficient geometrically by con- 
sidering y/p{x) and Wq(x) as unit vectors. The Bhattacharyya 
distance is then the dot product, representing the cosine of 
the angle made by the two unit vectors. The Bhattacharyya 
distance B : X x X ^ K + is derived from its coefficient [361 



as 



B(p,q) = -lnC(p,q). 



(55) 



The Bhattacharyya distance allows one to get both upper 
and lower bound the Bayes' classification error 11381 . [39|, 
while there are no such results for the symmetric Kullback- 
Leibler divergence. Both the Bhattacharyya distance and the 
symmetric Kullback-Leibler divergence agrees with the Fisher 
information at the infinitesimal level. Although the Bhat- 
tacharyya distance is symmetric, it is not a metric. Neverthe- 
less, it can be metrized by transforming it into to the following 
Helling er metric [40]: 

H{p, q) = J^J (y/pjx)- Vqjx^dx, (56) 
such that < H (p, q) < 1. It follows that 



H(p,q) 



"l\ (^Jp( x ) dx + Jq(x)dx-2 J y/p(x)y/q(x)dxj 

= y/l-C(p,q). (57) 

Hellinger metric is also called Matusita metric ||37l in the 
literature. The thesis of Hellinger was emphasized in the work 
of Kakutani PTI 

We consider a direct generalization of Bhattacharyya coef- 
ficients and divergences called Chernoff divergences;] 

B a (p,q) = -In ( p a (x)q 1 ~ a {x)&x = -\nC a {pA}%) 

J X 



= -\nE q [L a (x)] 



(59) 
(60) 



"'in the literature, Chernoff information is also defined 
as — log inf ag [Q ji fp a {x)q 1 ~ a (x)dx. Similarly, Chernoff 



coefficients C Q (p, q) are defined as the supremum: C a (p, q) 
su P aS [0,l]." 



Jp a (x)q 1 -°' (x)dx. 



defined for some a £ (0, 1) (the Bhattacharyya divergence 
is obtained for a = |), where E[] denote the expec- 
tation, and L(x) = H^l the likelihood ratio. The term 

V J q ( x ) 

j x p a (x)q 1 ~ a {x)dx is called the Chernoff coefficient. The 
Bhattacharyya/Chernoff distance of members of the same 
exponential family yields a weighted asymmetric Burbea-Rao 
divergence (namely, a skew Burbea-Rao divergence): 



B a (p F (x;9 p ),p F (x;9 q )) = BR^(9 p ,9 q ) (61) 



with 



BR 



{ F \9 p ,9 q ) = aF{9 p ) + (l-a)F(9 q )-F{a9 p + (l-a)9 q ) 

(62) 

Chernoff coefficients are also related to a-divergences, the 
canonical divergences in a-flat spaces in information geome- 
try HI (p. 57): 



D a (p\\q) = \ fp(x)log^dx = KL(p,q), 
/ g (x)logf|idx = KL( g ,p), 



a = -l, 
a = 1, 



(63) 



The class of a-divergences satisfy the following reference 
duality: D a {p\\q) = D_ a (q\\p). Remapping a' = (a 
1 — 2a'), we transform Amari a-divergences to Chernoff a! - 
divergences p] 



D a '(p,q) = < 



a' (l — a r ) 



J p(xY' qix) 1 -"' dx) 



Jp(x)lo g ^dx = KL(p, q ), 
_ J q (x)\o g ^dx = KL(q, P ), 



a' £{0,1}, 

1, 
0, 

(64) 



a 
a' 



Theorem 2: The Chernoff a'-divergence (a =^ ±1) of 
distributions belonging to the same exponential family is 
given in closed-form by means of a skewed Burbea-Rao 
divergence as: D a >(p,q) = et , (1 1 _ a ,- ) (1 — e~ BR ^ ( e J" e i)), with 



(<*), 



Jp, <Jq 



6 q ) = (aF{9 p ) - (1 - a)F(6 q )) - F(a9 p 



BR 

(1 — a)9 q ). Amari a-divergence for members of the same 
exponential families amount to compute D a (p, q) = 1 _ a2 (1 — 

e -BR> >{6 p ,6 q )\ 

We get the following theorem for Bhattacharyya/Chemoff 
distances: 

Theorem 3: The skew Bhattacharyya divergence 
B a (p,q) is equivalent to the Burbea-Rao divergence 
for members of the same exponential family 

£ F : B a (j>,q) B a (p F (x;9 p ),p F (x;9 q )) 

- RUW/ 



logC a (p F (x;9 p ),p F (x;9 q )) = BRV 



In particular, for a = ±1, the Kullback-Leibler divergence 
of those exponential family distributions amount to compute 



1 1 Chernoff coefficients are also related to Renyi a-divergence generalizing 
the Kullback-Leibler divergence: R a (p\\q) = — log f p(x) a q l ~ a {x)dx 
built on Renyi entropy H^(p) = jzz^ ^°s{f x P° (x)dx — 1). The Tsallis 
entropy H^, (p) = ^ (1 — f p(x) a dx) can also be obtained from the Renyi 
entropy (and vice-versa) via the mappings: H^,(p) = rj— (e^ 1 "")-^^) — 



1) and H%(p) 



log(l + (1 - a)H2(p)). 
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Let us compute the Chernoff coefficient for distributions belonging to the same exponential families. Without loss of generality, 
let us consider the reduced canonical form of exponential families pf(x; 9) — exp(x, 9) — F(9). Chemoff coefficients C a (p, q) 
of members p = p F (x;8 p ) and q = p F (x;9 q ) of the same exponential family £ F : 

C a ( P ,q) = j ' P a (x)q 1 - a (x)dx = j ' p^\x;9 p )pp- a (x;9 g )dx 

exp(a({x, P ) - F(9 p ))) x exp((l - a)({x,9 q ) - F{8 q )))dx 
exp ((x, a9 p + (1 - a)9 q ) - (aF(O p ) + (1 - a)F(9 q )) dx 
= exp-{aF(9 p ) + (1 - a)F{9 q )) x J exp ((x, a9 p + (1 - a)9 q ) - F{a9 p + (1 - a)9 q ) + F(a9 p + (1 - a)9 q )) dx 
= exp {F(a9 p + (1 - a)9 q ) - {aF{9 p ) + (1 - a)F{9 q )) x J exp(a;, a9 p + (1 - a)9 q ) - F{a9 p + (1 - a)9 q )dx 

= exp (F(a9 p + (1 - a)6» g ) - (aF(6> p ) + (1 - a)F{9 q )) x J p F {x; a9 p + (1 - a)9 q )dx 

v v ' 

=i 

= cxp(-BR^ ) (0 p ,^)) > 0. 



a Bregman divergence \ 14 1 (by taking the limit as a — > 1 or In order to compare this scheme on multivariate data with 

a — > 0). our generic Burbea-Rao scheme, we extend the approach of 

Corollary 1: In the limit case a' <E {0, 1}, the a'- Rigazio et al. 11421 to multivariate Gaussians. Plugging the 

divergences amount to compute a Kullback-Leibler diver- Bhattacharyya distance of Gaussians in the energy function 

gence, and is equivalent to compute a Bregman divergence of the optimization problem (OPT), we get 

for the log-normalized on the swapped natural parameters: n \ /£ +£ \ 1 

KL(p F (x;0p),p F (x;9 q )) = B F (9 q ,9 p ). L(c) = ^ - {fx c - f — ^ — -j (Mc - 

Proof: The proof relies on the equivalence of Burbea- i=i ^ ' 

Rao divergences to Bregman divergences for extremal values \ ( fai ( s c+s. \ \ 

of « . {0,1}. ^ ' + 2 kg TO' (69) 



This is equivalent to minimize the following energy: 

n 



KL(p,q) = KL(p F (x;9 p ),p F (x;9 q )) (65) _^ 

= lim D a ,(p F (x;9 p ),p F (x;9 q )) (66) F(c) = Y] (p c - ^ (E c + E;) -1 (p c - ^) 

1 i=1 

= lim— -(l-C a (p F (x;9 p ), P F(x;9 q ))) + 2 log (det(E c + £*)) - log (det E c ) 



o'-n a'(l — a') 



log(2 2rf detS. i ) . (70) 



since cxpi^^oHi 

- lj m 1 BR Q (9 9 ) (67) ^ n orc ^ er to mini 111 ! 26 ^( c )> l et us differentiate with respect to 

oMio'(l-a') > F ^ p ' g ! % c . let [/i denote (S c + Using matrix differentials El 

(i-a')s F (e„e p ) (p.io Eq. 73), we get: 

= lim ±B F (9 q ,0 p ) = B F (9 q ,9 p ) (68) ^ f T] 

Similarly, we have lim Q /_,.o D a >(p F (x; 9 p ),pf(x; 9 q )) = 1=1 

KL(p F (x;9 q ),p F (x;9 p )) = B F (9 p ,9 q ). ■ Then one can estimate iteratively since E/j depends on 

Table |I] reports the Bhattacharyya distances for members of S c which is unknown. We update p, c as follows: 

the same exponential families, r n "i i r n 

/i C (*+i)= E^ + c/ n E[^ + c/ n^ 

C. Direct method for calculating the Bhattacharyya centroids Ll=1 J Ll=1 

o/ multivariate normals Now let us estimate S c . We used matrix differentials J43] (p.9 

„ , „ , , , „, „ , . , Eq. 55 for the first term, and Eq. 51 p. 8 for the two others): 

To the best of our knowledge, the Bhattacharyya centroid n > -i ±- 

has only been studied for univariate Gaussian or diagonal dL ^ T T ^ T 

multivariate Gaussian distributions ll42ll in the context of QY, C ^ 1 1 

speech recognition, where it is reported that it can be estimated l ~ l n 

using an iterative algorithm (no convergence guarantees are +2 U T — ~S~^ YT T (73) 

reported in ED). " " ~i * ~i * ' 



(72) 
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Exponential family 



Multinomial 

Poisson 

Gaussian 



r : A 



F(0) (up to a constant) 



Bhattacharyya/Burbea-Rao BR F (A P , X q ) = BR f (t(A p ), r(A 9 )) 



log A 

(6»i =^e 2 = <r 2 ) 



log(l + Eti 1 cxp0 ! ) 
exp 8 




1 (Mp-M g ) 2 ill- "1+** 



Multivariate Gaussian (6 = E _1 /tt, = flT 1 ) if^e- 1 ^) - | logdet0 - ^ 9 ) T ^ Sp+S<; 



(Hp - M?) + 2 ln 



det E p det £, 

TABLE I 

Closed-form Bhattacharyya distances for some classes of exponential families (expressed in source parameters for ease of 

USE)). 



Taken into account the fact that S c is symmetric, differential 
calculus on symmetric matrices can be simply estimate: 

1 T 



dL _ dL 
Thus, if one notes 



dL 



diag 



dL 

as: 



A = Y J Wj -Uj ( Mc - Mi) (Mc - Mif 

i=l 

and recalling that E c is symmetric, one has to solve 

n(2E- 1 - diag(S" 1 )) = A + A T — diag(A). 

Let 

B = A + A T — diag(A) 
Then one can estimate E c iteratively as follows: 



+diag(S( fe ))) 



(74) 

(75) 

(76) 
(77) 

(78) 



Let us now compare the two generic Burbea-Rao/tailored 
Gaussian methods for computing the Bhattacharyya centroids 
on multvariate Gaussians. 



D. Applications to mixture simplification in statistics 

Simplifying Gaussian mixtures is important in many appli- 
cations arising in signal processing ESI . Mixture simplifica- 
tion is also a crucial step when one wants to study the Rie- 
mannian geometry induced by the Rao distance with respect 
to the Fisher metric: The set of mixture models need to have 
the same number of components, so that we simplify source 
mixtures to get a set of Gaussian mixtures with prescribed 
size. We adapt the hierarchical clustering algorithm of Garcia 
et al. Il26l by replacing the symmetrized Bregman centroid 
(namely, the Jeffreys -Bregman centroid) by the Bhattacharyya 
centroid. We consider the task of color image segmentation 
by learning a Gaussian mixture model for each image. Each 
image is represented as a set of 5D points (color RGB and 
position xy). 

The first experimental results depicted in Figure [3] demon- 
strates the qualitative stability of the clustering performance. 
In particular, the hierarchical clustering with respect to the 
Bhattacharrya distance performs qualitatively much better on 
the last colormap imagef^] 



12 See reference images and segmentation using Bregman centroids at http: 
//www. informationgeometry. org/MEF/ 



The second experiment focuses on characterizing the nu- 
merical convergence of the generic Burbea-Rao method com- 
pared to the tailored Gaussian method. Since we presented 
two novel different schemes to compute the Bhattacharyya 
centroids of multivariate Gaussians, one wants to compare 
them, both in terms of stability and accuracy. Whenever the 
ratio of Bhattacharyya distance energy function between those 
estimated centroids is greater than 1%, we consider that one 
of the two estimation methods is beaten (namely, the method 
that gives the highest Bhattacharyya distance). Among the 760 
centroids computed to generate Figures [5] 100% were correct 
with the Burbea-Rao approach, while only 87% were correct 
with the tailored multivariate Gaussian matrix optimization 
method. The average number of iterations to reach the 1% 
accuracy is 4.1 for the Burbea-Rao estimation algorithm, and 
5.2 for the alternative method. 

Thus we experimentally checked that the generic CCCP 
iterative Burbea-Rao algorithm described for computing the 
Bhattacharrya centroids always converge, and moreover beats 
another ad- hoc iterative method tailored for multivariate Gaus- 
sians. 

VI. Concluding remarks 

In this paper, we have shown that the Bhattacharrya distance 
for distributions of the same statistical exponential families 
can be computed equivalently as a Burbea-Rao divergence 
on the corresponding natural parameters. Those results ex- 
tend to skew Chernoff coefficients (and Amari a-divergences) 
and skew Bhattacharyya distances using the notion of skew 
Burbea-Rao divergences. We proved that (skew) Burbea-Rao 
centroids are unique, and can be efficiently estimated using 
an iterative concave-convex procedure with guaranteed con- 
vergence. We have shown that extremally skewed Burbea- 
Rao divergences amount asymptotically to evaluate Bregman 
divergences. This work emphasizes on the attractiveness of 
exponential families in Statistics. Indeed, it turns out that for 
many statistical distances, one can evaluate them in closed- 
form. For sake of brevity, we have not mentioned the recent 
/3-divergences and 7-divergences [44], although their distances 
on exponential families are again available in closed-form. 

The differential Riemannian geometry induced by the class 
of such Jensen difference measures was studied by Burbea 
and Rao |18|, [19] who built quadratic differential metrics 
on probability spaces using Jensen differences. The Jensen- 
Shannon divergence is also an instance of a broad class of 
divergences called the /-divergences. A /-divergence If is a 
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(a) 



(b) 




(O 

Fig. 3. Color image segmentation results: (a) source images, (b) segmentation with k = 48 5D Gaussians, and (c) segmentation with k = 16 5D Gaussians. 



statistical measure of dissimilarity defined by the functional 
If(p,q) — J p(x)f(^jhi)dx. It turns out that the Jensen- 
Shannon divergence is a /-divergence for the generator 



+ x log X 



(79) 



/-divergences preserve the information monotonicity [44 1, and 
their differential geometry was studied by Vos [45 1. However, 
this Jensen-Shannon divergence is a very particular case of 
Burbea-Rao divergences since the squared Euclidean distance 
(another Burbea-Rao divergence) does not belong to the class 
of /-divergences. 

Source code 

The generic Burbea-Rao barycenter estimation algorithm 
shall be released in the JMEF open source library: 

http://www.informationgeometry.org/MEF/ 
An applet visualizing the skew Burbea- 
Rao centroids ranging from the right-sided to 
left-sided Bregman centroids is available at: 
http://www.informationgeometry.org/BurbeaRao/ 
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